Search CORE

8 research outputs found

Automatic grammar induction from free text using insights from cognitive grammar

Author: Muralidaran Vigneshwaran
Publication venue
Publication date
Field of study

Automatic identification of the grammatical structure of a sentence is useful in many Natural Language Processing (NLP) applications such as Document Summarisation, Question Answering systems and Machine Translation. With the availability of syntactic treebanks, supervised parsers have been developed successfully for many major languages. However, for low-resourced minority languages with fewer digital resources, this poses more of a challenge. Moreover, there are a number of syntactic annotation schemes motivated by different linguistic theories and formalisms which are sometimes language specific and they cannot always be adapted for developing syntactic parsers across different language families. This project aims to develop a linguistically motivated approach to the automatic induction of grammatical structures from raw sentences. Such an approach can be readily adapted to different languages including low-resourced minority languages. We draw the basic approach to linguistic analysis from usage-based, functional theories of grammar such as Cognitive Grammar, Computational Paninian Grammar and insights from psycholinguistic studies. Our approach identifies grammatical structure of a sentence by recognising domain-independent, general, cognitive patterns of conceptual organisation that occur in natural language. It also reflects some of the general psycholinguistic properties of parsing by humans - such as incrementality, connectedness and expectation. Our implementation has three components: Schema Definition, Schema Assembly and Schema Prediction. Schema Definition and Schema Assembly components were implemented algorithmically as a dictionary and rules. An Artificial Neural Network was trained for Schema Prediction. By using Parts of Speech tags to bootstrap the simplest case of token level schema definitions, a sentence is passed through all the three components incrementally until all the words are exhausted and the entire sentence is analysed as an instance of one final construction schema. The order in which all intermediate schemas are assembled to form the final schema can be viewed as the parse of the sentence. Parsers for English and Welsh (a low-resource minority language) were developed using the same approach with some changes to the Schema Definition component. We evaluated the parser performance by (a) Quantitative evaluation by comparing the parsed chunks against the constituents in a phrase structure tree (b) Manual evaluation by listing the range of linguistic constructions covered by the parser and by performing error analysis on the parser outputs (c) Evaluation by identifying the number of edits required for a correct assembly (d) Qualitative evaluation based on Likert scales in online surveys

Online Research @ Cardiff

Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text

Author: Chakravarthi Bharathi Raja
McCrae John P.
Muralidaran Vigneshwaran
Priyadharshini Ruba
Publication venue
Publication date: 30/05/2020
Field of study

arXiv.org e-Print Archive

Irish Universities

Access to Research at National University of Ireland, Galway

DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text

Author: Chakravarthi Bharathi Raja
Jose Navya
McCrae John P.
Muralidaran Vigneshwaran
Priyadharshini Ruba
Sherly Elizabeth
Suryawanshi Shardul
Publication venue
Publication date: 17/06/2021
Field of study

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7,000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff's alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning methods. The dataset is available on Github (https://github.com/bharathichezhiyan/DravidianCodeMix-Dataset) and Zenodo (https://zenodo.org/record/4750858\#.YJtw0SYo\_0M).Comment: 36 page

arXiv.org e-Print Archive

Online Research @ Cardiff

PubMed Central

Corpus creation for sentiment analysis in code-mixed Tamil-English text

Author: Chakravarthi Bharathi Raja
McCrae John P.
Muralidaran Vigneshwaran
Priyadharshini Ruba
Publication venue: European Language Resources Association (ELRA)
Publication date: 24/07/2020
Field of study

Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmarkThis publication has emanated from research supported in part by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 (Insight), SFI/12/RC/2289 P2 (Insight 2), co-funded by the European Regional Development Fund as well as by the EU H2020 programme under grant agreements 731015 (ELEXIS-European Lexical Infrastructure), 825182 (Pret- ˆ a-LLOD), and Irish Research Council ` grant IRCLA/2017/129 (CARDAMOM-Comparative Deep Models of Language for Minority and Historical Languages).non-peer-reviewe

Access to Research at National University of Ireland, Galway

Corpus creation for sentiment analysis in code-mixed Tamil-English text

Author: Chakravarthi Bharathi Raja
McCrae John P.
Muralidaran Vigneshwaran
Priyadharshini Ruba
Publication venue: European Language Resources Association (ELRA)
Publication date: 27/07/2020
Field of study

Irish Universities

A practical implementation of a porter stemmer for Welsh

Author: Arman Laura
Knight Dawn
Muralidaran Vigneshwaran
O'Hare Keziah
Palmer Geraint
Spasic Irena
Publication venue: Bangor University
Publication date
Field of study

Online Research @ Cardiff